Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-1941

Nimbus discovery can fail when zookeeper reconnect happens.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.0.0, 1.0.1
    • 2.0.0, 1.0.2, 1.1.0
    • storm-core
    • None

    Description

      When zookeeper reconnect happens, nimbus registry can be deleted though nimbus is alive.

      Below is zookeeper node for nimbus registry.

      get /storm/nimbuses/<host>:6627
      ?f`d``??????M?-?-.?/??5??/H?+.IL???ON??``b`?|???^^???????
      ?'h?g?g?g?g
      t-?,[??Q
      cZxid = 0x4000005ae
      ctime = Fri Jul 01 11:43:51 UTC 2016
      mZxid = 0x4000005ae
      mtime = Fri Jul 01 11:43:51 UTC 2016
      pZxid = 0x4000005ae
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x255a62e310c0005
      dataLength = 98
      numChildren = 0
      
      get /storm/nimbuses/<host>:6627
      ?f`d``??????M?-?-.?/??5??/H?+.IL???ON??``b`?|???^^???????
      ?'h?g?g?g?g
      t-?,[??Q
      cZxid = 0x4000005ae
      ctime = Fri Jul 01 11:43:51 UTC 2016
      mZxid = 0x50000000e
      mtime = Fri Jul 01 11:46:08 UTC 2016
      pZxid = 0x4000005ae
      cversion = 0
      dataVersion = 1
      aclVersion = 0
      ephemeralOwner = 0x255a62e310c0005
      dataLength = 98
      numChildren = 0
      

      Below is transaction log for that node.

      7/1/16 11:43:51 AM UTC session 0x255a62e310c0005 cxid 0xd zxid 0x4000005ae create '/storm/nimbuses/<host>:6627,#1fffffff8b80000000ffffffe36660646060ffffff90ffffffcfffffffcaffffffc9ffffffccffffffd54dffffffcc2dffffffd62d2effffffc92fffffffcaffffffd535ffffffd2ffffffcb2f48ffffffcd2b2e494cffffffceffffffceffffffc94f4effffffccffffffe160606260ffffff907cffffffccffffffc1ffffffc01c5e165effffffceffffffc4ffffffc0ffffffc2ffffffc0ffffffcdffffffc0affffffd42768ffffffa867ffffffa067ffffffa867ffffffa467affffffa4d742dffffff8c2c1805b14ffffffc2ffffffaf51000,v{s{31,s{'world,'anyone}}},T,10
      
      7/1/16 11:46:08 AM UTC session 0x355a647bd8c0000 cxid 0x3 zxid 0x50000000e setData '/storm/nimbuses/<host>:6627,#1fffffff8b80000000ffffffe36660646060ffffff90ffffffcfffffffcaffffffc9ffffffccffffffd54dffffffcc2dffffffd62d2effffffc92fffffffcaffffffd535ffffffd2ffffffcb2f48ffffffcd2b2e494cffffffceffffffceffffffc94f4effffffccffffffe160606260ffffff907cffffffccffffffc1ffffffc01c5e165effffffceffffffc4ffffffc0ffffffc2ffffffc0ffffffcdffffffc0affffffd42768ffffffa867ffffffa067ffffffa867ffffffa467affffffa4d742dffffff8c2c1805b14ffffffc2ffffffaf51000,1
      

      Please take a look at ctime, mtime, and ephemeralOwner.
      Ephemeral owner session was already closed from nimbus side but there's possible for node to be not deleted immediately, so new session doesn't create new node but set the value to ephemeral node for other session which is already closed.
      And eventually that node is deleted although session 0x355a647bd8c0000 is alive.

      2016-07-01 11:45:05.675 o.a.s.s.o.a.z.ClientCnxn [DEBUG] Disconnecting client for session: 0x255a62e310c0005
      2016-07-01 11:45:05.675 o.a.s.s.o.a.z.ZooKeeper [INFO] Session: 0x255a62e310c0005 closed
      

      We can delete the node first and set ephemeral node when reconnect event handler is called.

      Attachments

        Activity

          People

            kabhwan Jungtaek Lim
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: